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Abstract —The Gaussian Filter (GF) is one of the most widely 
used filtering algorithms; instances are the Extended Kalman 
Filter, the Unscented Kalman Filter and the Divided Difference 
Filter. GFs represent the heUef of the current state hy a Gaussian 
with the mean being an affine fnnction of the measnrement. We 
show that this representation can be too restrictive to accurately 
capture the dependences in systems with nonlinear observation 
models, and we Investigate how the GF can be generalized to 
alleviate this problem. To this end, we view the GF from a 
variational-inference perspective. We analyse how restrictions on 
the form of the belief can be relaxed while maintaining simplicity 
and efficiency. This analysis provides a basis for generalizations 
of the GF. We propose one such generalization which coincides 
with a GF using a virtual measurement, obtained by applying 
a nonlinear function to the actual measurement. Numerical 
experiments show that the proposed Feature Gaussian Filter 
(FGF) can have a substantial performance advantage over the 
standard GF for systems with nonlinear observation models. 

I. Introduction 

Decision making requires knowledge of some variables of 
interest. In the vast majority of real-world problems, these 
variables are latent, i.e. they cannot be observed directly and 
must be inferred from available measurements. To maintain an 
up-to-date belief over the latent variables, past measurements 
have to be fused continuously with incoming measurements. 
This process is called filtering and its applications range from 
robotics to estimating a communication signal using noisy 
measurements. 

A. Dynamical Systems Modelling 

Dynamical systems are typically modelled in a state-space 
representation, which means that the state is chosen such 
that the following two statements hold. First, the current 
observation depends only on the current state. Secondly, the 
next state of the system depends only on the current state. 
These assumptions can be visualized by the belief network 
shown in Figure 1 . 

We assume the system to be stationary, i.e. there is no 
explicit dependence on time. Therefore, the absolute time 
indices are irrelevant. Only the time difference within a figure 
or equation is of importance. To simplify notation, we will use 
the indices 1 , 2,3 throughout the paper. 


0 

Figure 1. The belief network which characterizes the evolution of the state 
X and the observations y. 

A stationary system can be characterized by two functions. 
The process model 

X2=g{Xl,V2) ( 1 ) 

describes the evolution of the state. Without loss of generality, 
we can assume the noise V2 to be drawn from a Gaussian with 
zero mean and unit variance, since it can always be mapped 
onto any other distribution inside of the nonlinear function 
g{-). The observation model 

y2 = h{x2,W2) ( 2 ) 

describes how a measurement is produced from the current 
state. Following the same reasoning as above, we assume the 
noise W2 to be Gaussian with zero mean and unit variance. 
The process and observation models can also be represented 
by distributions. The distributional form of both models are 
implied by their functional form 

p{x 2 \xi) = j 6{x2- g{xi,V 2 ))p{v 2 ) ( 3 ) 

V2 

p{y2\x2) = J S{y2 - h{x2,W2))p{w2) ( 4 ) 

W2 

where S is the Dirac delta function. While both representations 
contain the exact same information, sometimes one is more 
convenient than the other. 

B. Exact Filtering 

The desired posterior distribution over the current state 
p{x2\y:2) can be computed recursively from the distribution 
over the previous state p{xi\y:i)\ the subscript (: t) denotes 
all time steps up to t. This recursion can be written in two 












Figure 2. A taxonomy of filtering algorithms. 


steps, a prediction step 


p{x2\y:i) = / p{x2\xi)p[xi\y,i) 


and an update step 

P{.X2\y:2) 


p{y2\x2)p{x2\y-.i) 
j p{y2\x2)p{x2\y-i)' 

X2 


(5) 


( 6 ) 


Since conditional expectation methods suffer from the curse 
of dimensionality, we focus on joint expectation methods in 
this paper. To the best of our knowledge, all such methods 
approximate the true joint distribution p(a; 2 , y 2 | 2 /:i) with a 
Gaussian distribution q{x 2 ,y 2 \y:i) and subsequently condition 
on y 2 , which is easy due to the Gaussian form. This approach 
is called the Gaussian Filter, of which the well known EKF 
[19], the UKF [9] and the Divided Difference Filter (DDF) 
[15] are instances [21, 8]. 

Morelande and Garcia-Fernandez [14] show that for non¬ 
linear dynamical systems, Gaussians can yield a poor fit to 
the true joint distribution p(x 2 , y 2 |?/:i). which in turn leads 
to bad filtering performance. To address this problem, we 
search for a more flexible representation of the belief that can 
accurately capture the dependences in the dynamical system, 
while maintaining the efficiency of the GF. 

In Sections II to IV, we first review existing filtering 
methods, in particular the GF. Then we find some desiderata 
for the form of the approximate belief in Section V to provide 
a basis for efficient generalizations of the GF. In Section VI, 
we propose one possible form of the approximate belief and 
show that this generalization coincides with the GF using a 
virtual measurement given by a nonlinear function of the actual 
measurement. Numerical examples in Section VII highlight 
the potential performance gains of the proposed filter over the 
standard GF. 


Kalman [10] found the solution to these equations for linear 
process and observation models with additive Gaussian noise. 
However, filtering in nonlinear systems remains an important 
area of research. Exact solutions [2, 5] have been found for 
only a very restricted class of process and observation models. 
Eor more general dynamical systems, it is well known that 
the exact posterior distribution cannot be represented by a 
finite number of parameters [11]. Therefore, the need for 
approximations is evident. 

C. Approximate Filtering 

Approximate filtering methods are typically divided into 
deterministic, parametric methods, such as the Unscented 
Kalman Eilter (UKF) [9] and the Extended Kalman Filter 
(EKF) [19], and stochastic, nonparametric methods such as 
the Particle Filter (PF) [7]. In this paper, we argue that there 
is a more fundamental division between filtering methods. 

To the best of our knowledge, all existing filtering al¬ 
gorithms either compute expectations with respect to the 
conditional distribution p{x 2 \y- 2 ) or with respect to the joint 
distribution p{x 2 ,y 2 \y:i)- In Figure 2, we divide approximate 
filtering algorithms according this criterion. The computa¬ 
tional power required to numerically compute expectations 
with respect to p{x 2 \y-. 2 ) increases exponentially with the 
state dimension, limiting the use of such methods to low 
dimensional problems. In contrast, expectations with respect 
to the joint distribution p(x 2 ,y 2 |y:i) can be approximated 
numerically with linear complexity in the state dimension. In 
Section III, we show how this fundamental difference arises. 


II. Approximate Prediction 

We start out with the distribution p{xi\y-i) computed in the 
previous time step. The representation of the beliefs might be 
parametric, such as a Gaussian, or it might be nonparametric, 
e.g. represented by a set of samples. In any case, the goal 
is to find the prediction p{x 2 \y-.i) given the previous belief. 
When there is no closed form solution to (5), we have to settle 
for finding certain properties of the predicted belief p{x 2 \y-i) 
instead of the full distribution. For all filtering algorithms 
we are aware of, these desired properties can be written as 
expectations 

J fix2)pix2\y-.i)- (7) 

For instance with /(X 2 ) = X 2 , we obtain the mean p, and 
with /(X 2 ) = (x 2 — p){x 2 — we obtain the covariance. 
These expectations can then be used to find the parameters 
of an approximate distribution. A widely used approach is 
moment matching, where the moments of the approximate 
distribution are set to the moments of the exact distribution. 
We will analyse such methods in more detail below. What is 
important here is that we are always concerned with finding 
expectations of the form of (7). 

We substitute (5) in (7) in order to write this expectation in 
terms of the last belief and the process model: 

J fix2)pix2\y-.i) = J fix2) j pix2\xi)p{xi\y,i). ( 8 ) 

X2 X2 Xi 



Substituting the distributional process model (3) and solving 
the integral over X 2 , which is easy due to the Dirac distribution 
S, we obtain 



For certain process models g and functions /, it is possible to 
find a closed form solution. In general, however, this integral 
has to be computed numerically. Since p{v 2 ) is the Gaussian 
noise distribution and p{xi\y.,i) is the previous belief in the 
representation of choice, it is generally possible to sample 
from these two distributions. This is crucial since it allows 
for efficient numerical integration. 

One possibility is to use Monte Carlo sampling to approx¬ 
imate the expectation from (9). The standard deviation of the 
estimate is proportional to with L being the number 
of samples. The dimension of the state does not affect the 
standard deviation of the estimate [16]. 

Another possibility is to use deterministic numerical inte¬ 
gration algorithms, such as Gaussian quadrature methods. The 
complexity of such methods typically scales linearly with the 
state dimension [21]. 

Which particular numeric integration method is used to 
compute the approximate expectations is inconsequential for 
the results presented in this paper. What is important is that 
expectations of the type required in the prediction step can 
be approximated efficiently, even for a high dimensional state. 
This is unfortunately not the case for the update step, which 
is the issue we are addressing in this paper. 


III. Approximate Update 

The goal of the update step is to obtain an approximation 
of the posterior p{x 2 \y- 2 ), based on the belief p{x 2 \y:i) which 
has been computed in the prediction step. 


Carlo (SMC) [7, 4], or by applying deterministic methods such 
as Gaussian quadrature [12]. 

There is, however, a very important difference to the pre¬ 
diction step. We now need to compute the expectation of a 
function / weighted with the observation model p{y 2 \x 2 )- If 
these weights are very small at most evaluation points, the 
numeric integration becomes inaccurate, an effect known as 
particle deprivation in particle filters [4]. 

Unfortunately, this effect becomes worse with increasing 
dimensionality. To see this, consider a simple example with 
predictive distribution p(a: 2 |j/:i) = A/'(x 2 | 0 , 1) and observation 
model p{y 2 \x 2 ) = ■I^{y 2 \x 2 ,l)- Both the state and measure¬ 
ment dimensions are equal to D. Computing the expected 
weight, i.e. the expected value of the likelihood, yields 

E[p{y2\x2)]= jp{y2\X2)p{y2\X2)p{.X2\y:l) = 

X2,y2 

That is, the expected weight decreases exponentially with the 
dimension D. In fact, it is well known that the computational 
demands of such methods increase exponentially with the 
state dimensionality [13, 3, 16]. Thus, methods that rely on 
the computation of conditional expectations are restricted to 
dynamical systems which either have a simple structure such 
that expectations can be computed analytically, or are low 
dimensional such that numeric methods can be used. 

B. Computation of Joint Expectations 

There are a number of approaches which avoid computing 
such expectations with respect to the conditional distribution 
p{x 2 \y-. 2 )- Instead, these methods express the parameters of the 
approximate posterior q{x 2 \y- 2 ) as a function of expectations 
with respect to the joint distribution; 

Jf{X2, y2)p{y2, X 2 \y-.l) =Jf{x2,y2)p{y2\X2)p{x2\y:l) ( 13 ) 

X 2 ,y 2 X 2 ,y 2 


A. Computation of Conditional Expectations 

As for the prediction, when there is no exact solution to 
(6), we compute expectations with respect to the posterior 
jx^x{x 2 )p{x 2 \y: 2 ), where r(-) is an arbitrary function. We 
insert (6) to express this expectation in terms of the observation 
model and the predicted distribution; 

j r{x2)p{y2\x2)p{x2\y:l) 

/ 'r{x2)p{x2\y-.2) = """ r , I ^^-■ (10) 

J J P[y2lX2)p(X2ly:l) 

^2 X2 

Both the numerator and the denominator can be written as 



( 11 ) 


with f{x) = r{x) for the numerator and f{x) = 1 for 
the denominator. The update step thus amounts to computing 
expectations of the form of (11). 

As in the prediction step, we can approximate this expec¬ 
tation either by sampling, which is used in Sequential Monte 


Inserting the observation model from (4) into the joint expec¬ 
tation above and solving the integral over j /2 yields 


f{x2,y2)piy2,x2\y-.i) = 


X2,y2 


J f{x2,h{x2,W2))p{w2)p{X2\y-.l). 

3^2,W2 


(14) 


This term has the same form as the expectation in the 
prediction step (9). It is an integral of an arbitrary function 
with respect to probability densities that can be sampled. This 
allows us to approximate this expectation efficiently, even for 
high dimensional states. 


C. Conclusion 

The insight of this section is that computing expecta¬ 
tions numerically with respect to the conditional distribution 
Pix 2 \y: 2 ) requires exponential computational power in the 
state dimension, whereas the complexity of computing expec¬ 
tations with respect to the joint distribution p{x 2 ,y 2 \y:i) scales 






linearly with the state dimension. Note that expectations with 
respect to the marginals p{x 2 \y-i) and p{y 2 \v-.i) are a special 
case of an expectation with respect to the joint distribution 
and can be computed efficiently as well. 

In the remainder of the paper, we only consider the update 
step. Thus, the only variables we require are X 2 and y 2 ', xi 
will not be considered. Therefore, we drop the indices for 
ease of notation. Furthermore, we make the dependence on y-i 
implicit. That is, p{x 2 Ty 2 \y-.i) becomes p{x,y) and p{x 2 \y-. 2 ) 
becomes p{x\y), etc. 


Gaussian joint approximation q{x,y). The posterior (17), 
which we ultimately care about, is therefore Gaussian in the 
state X with the mean being an affine function of y. As 
we show in the experimental section, this form can be too 
restrictive to accurately capture the relationship between the 
measurement and the state in nonlinear settings. This leads to 
information about the state being discarded and ultimately to 
poor filtering performance. 

V. Generalization oe the Gaussian Filter 


IV. The Gaussian Filter 


The advantage in terms of computational complexity of joint 
expectation filters over conditional expectation filters comes 
at a price; The approximate posterior q{x\y) must have a 
functional form such that its parameters can be computed 
efficiently from these joint expectations. To the best of our 
knowledge, all existing joint expectation filters solve this issue 
by approximating the tiTie joint distribution p{x, y) with a 
Gaussian distribution; 


q{x,y) =M 




(15) 


The parameters of this approximation are readily obtained by 
moment matching, i.e. the moments of the Gaussian are set to 
the moments of the exact distribution; 



All of these expectations can be computed efficiently for 
reasons explained in the previous section. 

After the moment matching step, we condition on y to 
obtain the desired posterior, which is a simple operation since 
the approximation is Gaussian; 


y{^\y) — iy yy)T^xx '^xy^yy^xy)- ( 1 "^) 


This approach is called the Gaussian Filter (GF) [8, 18, 21]. 
Widely used filters such as the EKF [19], the UKF [9] and 
the DDF [15] are instances of the Gaussian Filter, differing 
only in the numeric integration method used for computing 
the expectations in (16). 

While much effort has been devoted to finding accurate 
numeric integration schemes for computing these expectations, 
there seems to be no joint expectation method using a non- 


In this section, we investigate whether it is possible to find 
a more general form of the approximate posterior q{x\y) that 
still allows for efficient computation of the parameters. To 
this end, we write the problem of finding the parameters of 
the approximation as an optimization problem. 

In the GF, the parameters 0 of the Gaussian belief q{x, 2/|0) 
are found by moment matching. For a Gaussian approxima¬ 
tion, moment matching is equivalent to minimizing the KL- 
divergence [1] 

KL[p{x,y)\q{x,y\e)] = J log p(a;, y). (18) 

x,V 

By minimizing (18) with respect to 0, we can thus retrieve the 
GF. Furthermore, the KL-divergence has convenient analytic 
properties. It is a widely used objective for matching distribu¬ 
tions and can be justified from an information theoretic point 
of view [1]. 

Having found an appropriate objective for the approxi¬ 
mation, it is natural to ask if it is possible to find more 
general, non-Gaussian approximations. The form of q{x,y\Q) 
is restricted by the requirement of being able to condition 
on y in closed form in order to find the approximate con¬ 
ditional q{x\y, 0). This requirement is met automatically if 
we choose the form of the conditional distribution and the 
marginal distribution separately, instead of picking a form 
for the joint distribution. The joint distribution is then given 
by 9(a;,t/|0) = q{x\y,9)q{y\'d), where we have split the 
parameter set 0 into 9 and i?. Any conditional and marginal 
distributions can be combined to form a valid joint distribu¬ 
tion. Hence, the respective parameter sets 9 and 9 can be 
chosen independently. Imposing any constraints tying the two 
parameter sets together would restrict the flexibility of the joint 
distribution unnecessarily. 

Inserting this factorization into (18), we obtain 


KL[p{x,y)\q{x,y\e)] = c{9)+ KL[p{x,y)\q{x\y,9)] (19) 


where we have collected all terms independent of 9 in c(t?). 
Since only the conditional distribution is of interest, we will 
maximize with respect to 9. Hence, we can drop the terms 
which do not depend on 9, which leads to the objective 
function 



( 20 ) 







Note that this is a somewhat unusual KL-divergence, since it 
compares a joint distribution with a conditional distribution. 
However, this configuration is very desirable in this context. 
We can directly obtain the approximate posterior distribution 
q{x\y^ 6) from the exact joint distribution p{x, y) by minimiz¬ 
ing (20) with respect to 9. Only expectations with respect to 
the joint distribution p{x, y) are required, and we have seen 
that these can be approximated efficiently. 


A. Desiderata for the Form of the Approximation 


In the following, we seek conditions on the form of q{x\y, 9) 
that allow for an efficient minimization of (20) with respect 
to 9. 

First, q{x\y,9) has to integrate to one in x since it is 
a probability distribution. We can enforce this condition by 
writing 


q{x\y^^) 


r{x,y,9) 

J^r{x,y,9) 


( 21 ) 


with r{x,y,9) being any positive function whose integral in 
X over the real domain is finite and non-zero. 

Furthermore, for the objective in (20) to be well defined, the 
support of q{x\y, 9) has to contain the support of p{x, y). Since 
p{x, y) could be any distribution, we will choose the form 
q{x\y, 9) such that it has infinite support; that is, q{x\y, 9) > 0 
everywhere, which implies r(x, y, 9) > 0. This condition is 
enforced by writing the approximate distribution as 


q{x\y,9) 


g/(a:,y,9) 

Jx 


( 22 ) 


with f{x, y, 9) = log(r(a;, y, 9)). The question we will address 
in the following is what / has to look like in order to obtain 
an efficient filtering algorithm. 

Substituting q{x\y, 9) in (20), we obtain 


KL[p{x,y)\q(x\y,9)\ = C+ 

J log ij I p{y) - j f{x,y,9)p{x,y) 


(23) 


where we have collected the terms which do not depend on 9 
in C. By setting the derivative with respect to 9 to zero, we 
obtain a criterion for stationarity 



dfix,y,9) 

89 


q{x\y,9)\p{y) = 


' df{x,y ,9) 
89 


p{x,y). (24) 


V \x 


V,^ 


If we choose /(•) such that the objective (23) is convex in 9, 
then (24) is a sufficient condition for optimality. 

Before this system of equations can be solved, all the 
integrals have to be computed. The integral over x on the 
left-hand side of (24) is an expectation with respect to the 
parametric approximation. Since the integrand depends on un¬ 
known parameters, this inner integral cannot be approximated 


numerically. Therefore, / has to be chosen such that there is 
a closed form solution. 

In general, the outer integral over y cannot be solved in 
closed form since p{y) can have a very complex form, de¬ 
pending on the dynamical system. However, expectations with 
respect to p{y) can be efficiently approximated numerically, 
as discussed above. Numeric integration is possible only if 
the integrand depends on no other variable than the ones 
we integrate out. Therefore, we require / to be such that, 
after analytically solving the inner integral over x, all the 
dependences on 9 can be moved outside of the integral over 

y- 

On the right-hand side of (24), we evaluate an expectation 
with respect to p{x,y). Again, it is not possible for general 
dynamical systems to find a closed form solution, but nu¬ 
merical expectations with respect to p{x, y) can be computed 
efficiently. To allow for numerical integration, / must be such 
that all the dependences on 9 can be moved outside of the 
integral over x and y. 

Finally, after computing the integrals, we have to solve 
the system of equations (24) in order to find the optimal 9. 
Therefore, /(•) should be such that this solution can be found 
efficiently. 

It is not clear how the most general y(a:|y,0) complying 
with the above desiderata can be found. Nevertheless, this 
discussion can guide the search for more general belief repre¬ 
sentations than the affine Gaussian, which leave the efficiency 
of the GF intact. The following section provides an example. 

VI. The Feature Gaussian Filter 

We propose to generalize the affine Gaussian approximate 
posterior of the GF by allowing for nonlinear features (j){y) of 
the measurement. More formally, we choose / in (22) as 

/(a;,y,r,S) = -]^{x - V^{y))'^(x - V(j){y)) (25) 

with parameters 9 = (F, S) and f an arbitrary feature 
function. This leads to an approximate distribution (22), which 
is Gaussian in x but can have nonlinear dependences on y, 

q{x\y, F, E) = N(x\V(l){y), E). (26) 

In the following, we show that because this approximation 
complies with the desiderata from the previous section, the 
parameters can be optimized efficiently. We refer to the 
resulting filtering algorithm as the Feature Gaussian Filter 
(FGF). Finally, we show that the FGF is essentially equivalent 
to the standard GF using a virtual measurement, obtained by 
mapping the actual measurement through a nonlinear function. 

A. Finding F 

The derivative with respect to F is 

^/(""’I^F’E) (27) 










and the corresponding analytic integral can readily be solved 
since the approximate distribution is Gaussian in x\ 



X 


df{x,y,r,T,) 

dr 


q{x\y,r,T,) = 0. 


(28) 


Inserting these results into (24), we can solve for F 


r = E[x(l){y)'^]E[(j}{y)(j){y)'^] ^ 


(29) 


B. Finding S 

The matrix E is constrained to be positive definite, such 
that the approximate distribution (26) is Gaussian. As it turns 
out, the unconstrained optimization yields a positive definite 
matrix. Thus, there is no need to take this constraint into 
account explicitly. 

The derivative with respect to is 


df{x,y,r,E) 


- r</'(y))(a; - T(j){y)f 


and the corresponding analytic integral in x is 



X 


df{x,y,r,E) 

as-i 


g(a:|y,r, E) 



Inserting these results into (24), we can solve for E 
E = E[{x - r(t){y)){x - r(/)(?/))^]. 


(30) 


(31) 


(32) 


C. Connection to the Gaussian Filter 


In the following we show that for a feature (j){y) = 
(c, ip{y)'^)'^, which contains a constant c ^ 0 and an arbitrary 
sub-feature ip, the FGF is equivalent to the GF using y = p{y) 
as the measurement. Inserting (^{y) = {c,y'^)'^ into (29), we 
obtain 


F = 






yy 


(33) 


with the parameters /i(.) and E(.) as defined in (16). The mean 
of the approximate posterior is 


'^(t>{y) = Bx + ^xy^y^{y-By)- (34) 

Inserting this result into (32), we obtain the covariance 

E = E,,-E,gEr.iE^j. (35) 

Clearly, these equations correspond to the GF equations (17). 
This means that, if the feature vector (j){y) contains a constant, 
the FGF is equivalent to the GF using the virtual measurement 
y = (p{y) instead of y. In particular, with a feature (j){y) = 
(1,2/^)^, we retrieve the standard GF. 

Applying nonlinear transformations to the physical sensor 
measurements before feeding them into a GF is not uncommon 
in robotics and other applications (see [6, 20, 17] for example). 
The formal analysis herein provides insight into the effect 
of such nonlinear transformations and reveals why they are 
beneficial. Namely, they allow for a better fit of the conditional 
distribution. While these transformations are often motivated 
from physical insight or introduced heuristically, we provide 
a different interpretation of (/) as a means of improving the 


fit of the posterior by allowing for more expressive nonlinear 
features. This shall be highlighted in the examples in Section 
VII, where we use monomials of increasing order as generic 
features. 

D. Feature Selection 

The above analysis shows that adding nonlinear features 
gives the approximate distribution more flexibility to fit the 
exact distribution. Overfitting is not possible since we are 
minimizing the KL-divergence to the exact distribution. It 
therefore makes sense to use as many features as the com¬ 
putational speed requirements allow. 

Ideally, one would choose a feature which maps the mea¬ 
surement to a representation which relates to the state linearly. 
If this is not possible, then generic features such as monomials 
can be used. 

E. Computational Complexity 

The only cause of a difference in computational complexity 
between the standard GF and the FGF is the difference in 
the dimension of the measurement y and the feature (t>{y). 
This means that the feature dimension has to be chosen such 
that the required computational speed is attained. The feature 
dimension can even be lower than the dimension of the actual 
measurement if the standard GF is too slow. 

VII. Analysis and Simulation of the Feature 
Gaussian Filter 

As the previous analysis suggests, it is beneficial to augment 
the measurement with nonlinear features since this gives the 
approximation more flexibility to fit the exact distribution, i.e. 
to achieve a lower KL-divergence (23). In this section, we 
illustrate this effect in more detail for two dynamical systems. 

A. Estimation of Sensor Noise Magnitude 

The measurement process (2) of a dynamical system can 
often be represented by a nonlinear observation model with 
additive noise 

h{x, M,w) = h{x) + Mw (36) 

where h is a nonlinear function of the system state, and the 
matrix M determines the magnitude of the sensor noise (recall 
that w is Gaussian with zero mean and unit variance). Often, 
the sensor accuracy (i.e. the matrix M) is not precisely known, 
or it may be time varying due to changing sensor properties 
and environmental conditions. It is then desirable to estimate 
the noise matrix M alongside the state x. In the following, we 
show that this is not possible with the standard GF, but can 
be achieved with the FGF. 

We define an augmented state x := (x; to), where m is 
a column vector containing all the elements of the noise 
matrix M. The observation model in distributional form is 
p{y\x) = p{y\x,m) = J\f{y\h{x),MM'^). The state x and 
the parameters to stem from independent processes, and we 
therefore have p{x) = p{x)p{m). Let us now apply the 
standard GF to this problem by computing the parameters in 








(16). In particular, we compute the covariance between the 
augmented state and the measurement 

^xy= j ^iy)^p{y\x,m)p{x)p{nl). (37) 

x,m,y 

The integral over y can be solved easily since p{y\x,m) is 
Gaussian, 

Siy = J CHx) - Py)^p{x)p{m). (38) 

x^m 

Interestingly, the second factor does not depend on m. There¬ 
fore, the integral over m is solved easily and yields 

=/.(mI-I)^C' d”)^ 

As a result, there is no linear correlation between the measure¬ 
ment y and the parameters m. Inserting this result into (17) 
shows that the innovation corresponding to m is zero. The 
corresponding part of the covariance matrix does not change 
either. The measurement has hence no effect on the estimate 
of TO. It will behave as if no observation had been made. This 
illustrates the failure of the GF to capture certain dependences 
in nonlinear dynamical systems. 

In contrast, if a nonlinear feature in the measurement y is 
used, the integral over y in (37) will not yield h{x), but instead 
some function depending on both x and to. This dependence 
allows the FGF to infer the desired parameters. 

Numerical example: For the purpose of illustrating the 
theoretical argument above, we use a small toy example. We 
consider a single sensor, where all quantities in (36), including 
the standard deviation M, are scalars. Since we are only 
interested in the estimate of M, we choose h{x) = 0. The 
observation model (36) simplifies to 

h{M2,W2) = M2W2- (40) 

Note that we have reintroduced time indices. Picking a simple 
process model and an initial distribution 

g{Mi,V2) = Ml + 0.1v2 (41) 

p(Mi) =AA(Mi|5,1) (42) 

the dynamical system (1), (2) is fully defined. This example 
captures the fundamental properties of the FGF as pertaining 
to the estimation of sensor noise intensity M. The same 
qualitative effects hold for multivariate systems (36) for the 
reasons stated above. 

In Figure 3, we plot the exact conditional distribution 
p{M 2 \y 2 ) implied by this system in grayscale. This distri¬ 
bution was computed numerically for the purpose of com¬ 
parison. It would, of course, be too expensive to use in a 
filtering algorithm. The overlaid orange contour lines show the 
approximate conditional distribution q{M 2 \y 2 ) obtained with 
the standard GF. No matter what measurement y 2 is obtained, 
the posterior q{M 2 \y 2 ) is the same. The GF does not react to 
the measurements at all. 
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Figure 3. Estimation of sensor noise magnitude: Density plot of the 
true conditional distribution p(M 2 \y 2 ) with overlaid contour lines of the 
approximate conditional distribution q(M 2 \y 2 ) of the GF in orange and of 
the FGF in blue. 



Figure 4. Estimation of sensor noise magnitude: The simulated noise 
parameter M is shown in black, together with the mean and standard deviation 
of the estimates obtained with the GF (orange) and the FGF (blue). 

The true conditional distribution p{M 2 \y 2 ) depends on 
2 / 2 , which means that the measurement does in fact contain 
information about the state M 2 . However, the approximation 
(/(M 2 I 2 / 2 ) made by the GF is not expressive enough to capture 
this information, which results in a very poor fit to p{M 2 \y 2 )- 

The standard GF is the special case of the FGF with the 
feature (j){y) = Let us take the obvious next step 

and add a quadratic term to the feature (j){y) = (l,//,?/^)^. 
The resulting approximation is represented by the blue con¬ 
tour lines in Figure 3. Clearly, q{x 2 \y 2 ) now depends on 
the measurement 2 / 2 , which allows the FGF to exploit the 
information about the state X 2 contained in the measurement. 
The approximation q{x 2 \y 2 ) of the FGF has a more flexible 
form, which allows for a better fit of the true posterior. 

To analyse actual Altering performance, we simulate the 
dynamical system and the two Alters for 1000 time steps. 
The results are shown in Figure 4. As expected, the standard 
GF does not react in any way to the incoming measurements. 
The FGF, on the other hand, is capable of inferring the state 
M from the measurement y, as suggested by the theoretical 
















- 10 . - 6 . - 2 . 2 . 6 . 10 . 


X 

Figure 5. Nonlinear observation model: Density plot of the true conditional 
distribution p{x 2 \y 2 ) with overlaid contour lines of the approximate condi¬ 
tional distribution q(x 2 \y 2 ) of the GF in orange and of the FGF in blue. 

analysis above. 

B. Nonlinear Observation Model 

In this section, we investigate how the theoretical benefit 
of adding nonlinear features translates into improved filtering 
performance for systems with nonlinear observation models. 
To clearly illustrate the difference of GF and FGF, we choose 
a simple system with a strong nonlinearity (step function). 
Given the theoretical analysis herein, it is to be expected that 
the insights gained from this artificial example extend to more 
realistic nonlinear problems in robotics and other applications. 

The process model, the observation model, and the initial 

state distribution are given by 

g{xi,V2) = X 1 +V 2 (43) 

h{x 2 ,W 2 ) = X 2 + W 2 + ^QH{x 2 ) (44) 

p(a::i) = A/'(a;i|0,5) (45) 

where H{-) is the Heaviside step function. 

In Figure 5, we plot the true conditional density p{x 2 \y 2 ) 
with overlaid orange contour lines of the approximate condi¬ 
tional distribution q{x 2 \y 2 ) obtained using the standard GF. 
The contour lines reflect the estimator structure of the GF 
in (17). The mean of the approximate density q{x 2 \y 2 ) is 
an affine function of the measurement j/ 2 - For nonlinear 
observation models, this coarse approximation can lead to loss 
of valuable information contained in the measurement j/ 2 - 
The approximate density 9 (^ 2 1 2 / 2 ) obtained using a feature 
4>{y) = (Ij 2/: which is represented by the blue 

contour lines in Figure 5, fits the true posterior much better. 
This illustrates that nonlinear features allow for approximate 
posteriors with much more elaborate dependences on y. 

Figure 6 shows how this difference translates to filtering 
performance. When x is far away from zero, the nonlinearity 
has no effect; the system behaves like a linear system. The 
density plot in this regime would be centered at a linear part 
of the distribution, and both filters would achieve a perfect fit. 
Both the standard GF and the FGF are therefore optimal in that 



Figure 6. Nonlinear observation model: We plot the simulated state x (black) 
and the means and standard deviations of the estimates obtained with the GF 
(orange) and the FGF (blue). 


case. When the state is close to zero, however, the advantage 
of the FGF becomes apparent. Its tracking performance is 
good even when the state is close to the nonlinearity of the 
observation model, due to more flexibility in y 2 of the posterior 
approximation q (a; 2 1 2 / 2 ) ■ 


VIII. Conclusion 


We showed that the GF can be understood as an optimal ap¬ 
proximation to the exact distribution, subject to the constraint 
that the form of the belief q{x\y) be Gaussian in x and affine 
in y. Theoretical analysis and simulations showed that this 
form can be too restrictive to accurately represent the belief 
in nonlinear systems. We discussed how this constraint can 
be relaxed while maintaining the efficiency of the GF. This 
analysis served as a basis for potential generalizations of the 
GF. 

We proposed one such generalization, the Feature Gaussian 
Filter (FGF). The name is motivated by the fact that the FGF 
is equivalent to a GF that uses a virtual measurement, or 
feature, which is obtained by applying a nonlinear function to 
the actual measurement. We showed both theoretically and in 
simulation that using nonlinear features can significantly im¬ 
prove the performance of the GF. For instance, the practically 
relevant problem of estimating the sensor noise magnitude 
alongside the state cannot be tackled by the standard GF 
because the expressive power of its belief is too limited. We 
showed that this issue can be resolved by the FGF. 

The results obtained in the simulation examples herein 
are promising and suggest that the FGF may yield superior 
filtering performance for nonlinear problems in robotics and 
other applications. Analysing the performance of the FGF in a 
more realistic, high dimensional scenario remains future work. 

Whether it is possible to find an approximate posterior of a 
more general form than in the FGF, while complying with the 
requirements derived in Section V, is another open question. 
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